library(mosaic)
library(readr)
library(bestglm)
library(Stat2Data)
library(MASS)
library(readr)
logit = function(B0, B1, x)
{
exp(B0+B1*x)/(1+exp(B0+B1*x))
}
NBA_Data = read_csv("/Users/reidbrown/Documents/Senior/Spring 2020/STOR 455/Homework/Data For HW6/nba.games.stats.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Team = col_character(),
## Date = col_date(format = ""),
## Home = col_character(),
## Opponent = col_character(),
## WINorLOSS = col_character()
## )
## See spec(...) for full column specifications.
NBA_Data
Hornets_Data = NBA_Data[NBA_Data$Team =="CHO",]
Hornets_Data
Data Prep
I found this data set on Kaggle.com. It is called “NBA Team Game Stats from 2014 to 2018” and it was collected by Ionas Kelepouris. It was last updated 2 years ago, so the data is not current, but is still recent. I want to see how different variables predict the Charlotte Hornets likelihood of winning games. I downloaded the data, read it into R, and sliced out just the Hornets (abbreviated as CHO). This assignment is an extention of the work done in Homework 6, just now in a multiple logsitic regression setting.
#Use If/Else Statement to recode as a dummy variable
Hornets_Data$Win = ifelse(Hornets_Data$WINorLOSS == "W",1,0)
head(Hornets_Data)
Hornets_Data.1 = within(Hornets_Data,{WINorLOSS=NULL})
head(Hornets_Data.1)
Part A
#Make Home/Away binary. Home=1, Away=0
Hornets_Data.1$HomeCourt = ifelse(Hornets_Data$Home == "Home",1,0)
head(Hornets_Data.1)
Hornets_Data.2 = within(Hornets_Data.1,{Home=NULL})
head(Hornets_Data.2)
#Construct a model using at least three predictors and the same response variable as assignment #6 (also including the two predictors used in assignment #6 in your single predictor models). This does not need to be the “best” model, as was found with various model selection models in ordinary regression
HornetMod_glm=glm(Win~TeamPoints+OpponentPoints+HomeCourt+TotalRebounds,family=binomial,data=Hornets_Data.2)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Part B Null Hypothesis: All of the beta coefficients in this model are equal to 0 Alternative Hypothesis: At least one of the beta coefficients in this model is not equal to 0
#Compute the G-statistic and use it to test the effectiveness of your model. Include hypotheses and a conlcusion
G = HornetMod_glm$null.deviance - HornetMod_glm$deviance
Gdf = HornetMod_glm$df.null - HornetMod_glm$df.residual
1-pchisq(G, Gdf)
## [1] 0
The G statistic is 0 as determined by the code above. Because the p-value from observing the G statistic on the Chi^2 distribution is 0 we can reject the null hypothesis and conclude that at least one of the coefficients in our model is not equal to 0. This p-value of 0 tells us that our model strongly fits our data
Part C
Null Hypothesis: All of the predictors are not significant in the model Alternative Hypothesis: At least one of the predictors is significant in the model
#Test the effectiveness of each predictor in the model. Inlcude hypotheses and conlcusions
summary(HornetMod_glm)
##
## Call:
## glm(formula = Win ~ TeamPoints + OpponentPoints + HomeCourt +
## TotalRebounds, family = binomial, data = Hornets_Data.2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.227e-04 -2.100e-08 -2.100e-08 2.100e-08 1.160e-04
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0638 30980.9086 0.000 1.000
## TeamPoints 19.0568 2674.4546 0.007 0.994
## OpponentPoints -19.0396 2664.3930 -0.007 0.994
## HomeCourt 0.3921 5532.6326 0.000 1.000
## TotalRebounds -0.0212 512.8075 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4.5323e+02 on 327 degrees of freedom
## Residual deviance: 1.0398e-07 on 323 degrees of freedom
## AIC: 10
##
## Number of Fisher Scoring iterations: 25
Based on the above test, we cannot reject the null hypothesis for any of the predictors in our model. All of the p-values are a lot larger than 0.05. It appears that none of the individual predictors have an affect on whether or not the Hornets win a game. This is interesting because the Likelihood Ratio Test done in part B told us that the model as a whole predicted Wins very well.
Part D
#Recreate single predictor models from last homework. Also creating a model with both predictors TeamPoints and OpponentPoints to see if just HomeCourt and TotalRebounds (predictors added in beginning of this hw) are significant.
Hornet_Mod_PartA=glm(Win~TeamPoints,family=binomial,data=Hornets_Data.2)
Hornet_Mod_PartB=glm(Win~OpponentPoints,family=binomial,data=Hornets_Data.2)
Hornet_Mod_AandB=glm(Win~TeamPoints+OpponentPoints,family=binomial,data=Hornets_Data.2)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#looking at df for all models, deviance = residual deviance
Hornet_Mod_PartA$df.residual #reduced
## [1] 326
Hornet_Mod_PartB$df.residual#reduced
## [1] 326
Hornet_Mod_AandB$df.residual#reduced
## [1] 325
HornetMod_glm$df.residual #full
## [1] 323
#Difference of 3 df in reduced and full models
Null Hypothesis: The addition of the new variables in the full model does not create a significant change in the model’s predicting ability
Alternative Hypothesis: The addition of the new variables in the full model does create a significant change in the model’s predicting ability
#Test the effectiveness of this model with multiple predictors compared to each of your two models with a single predictor. Are there significant improvements? Include hypotheses and a conlcusion
1 - pchisq(summary(Hornet_Mod_PartA)$deviance - summary(HornetMod_glm)$deviance, 3)
## [1] 0
1 - pchisq(summary(Hornet_Mod_PartB)$deviance - summary(HornetMod_glm)$deviance, 3)
## [1] 0
1 - pchisq(summary(Hornet_Mod_AandB)$deviance - summary(HornetMod_glm)$deviance, 2)
## [1] 1
#or we can do this
anova(Hornet_Mod_PartA, HornetMod_glm, test="Chisq")
anova(Hornet_Mod_PartB, HornetMod_glm, test="Chisq")
anova(Hornet_Mod_AandB, HornetMod_glm, test="Chisq")
#looking at significance of single predictor model to combined model with both of the original predictors.
anova(Hornet_Mod_PartA, Hornet_Mod_AandB, test="Chisq")
anova(Hornet_Mod_PartB, Hornet_Mod_AandB, test="Chisq")
Based on the above tests we can see that both the full model (HornetMod_glm) is a significantly better model over both the single predictor models (Hornet_Mod_PartA and Hornet_Mod_PartB) but not the new model with both of the original predictors (Hornet_Mod_AandB). I made this new model to see if the new predictors (HomeCourt and TotalRebounds) were significant. According to these results it seems like the full model is significant over the single predictor models because the full model has both TeamPoints and OpponentPoints as predictors. The p-values do not show that adding HomeCourt and TotalRebounds to the model is significant.
Part E
#Use the bestglm function to determine the best model to predict your response with your given set of predictors. If your dataset has many variables, you can choose a subset of those variables to use in this model selection procedure
#reorder Win to last and pick the first 15 variables starting from TeamPoints (col 6)
Hornets.forbestglm = Hornets_Data.2[,c(6:18,40)]
Hornets.forbestglm = as.data.frame(Hornets.forbestglm)
head(Hornets.forbestglm)
#bestglm() only works with <15 predictors
Final_Hornet_Mod = bestglm(Hornets.forbestglm, family=binomial, maxit=100)
Final_Hornet_Mod
## BIC
## BICq equivalent for q in (0, 0.947673485393461)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6037367 553099.25 -1.091552e-06 0.9999991
## TeamPoints 25.0226828 52036.13 4.808713e-04 0.9996163
## OpponentPoints -25.0165409 52015.54 -4.809436e-04 0.9996163